# Zero-shot Inference
Devstral Small Vision 2505 GGUF
Apache-2.0
Vision encoder based on Mistral Small model, supports image-text generation tasks, compatible with llama.cpp framework
Image-to-Text
D
ngxson
777
20
Google.medgemma 4b It GGUF
MedGemma-4B-IT is a medical-focused image-to-text generation model developed by Google.
Image-to-Text
G
DevQuasar
6,609
1
VL Rethinker 7B 8bit
Apache-2.0
VL-Rethinker-7B-8bit is a multimodal model based on Qwen2.5-VL-7B-Instruct, supporting visual question answering tasks.
Text-to-Image
Transformers English

V
mlx-community
21
0
VL Rethinker 7B Fp16
Apache-2.0
This model is a multimodal vision-language model converted from Qwen2.5-VL-7B-Instruct, supporting visual question answering tasks.
Text-to-Image
Transformers English

V
mlx-community
17
0
Qwen2.5 VL 32B Instruct GGUF
Apache-2.0
Qwen2.5-VL-32B-Instruct is a multimodal vision-language model that supports joint understanding and generation tasks for both images and text.
Image-to-Text English
Q
samgreen
25.59k
6
Qwen2.5 VL 72B Instruct GGUF
Other
Qwen2.5-VL-72B-Instruct is a multimodal vision-language model that supports interactive generation tasks involving images and text.
Image-to-Text English
Q
samgreen
2,073
1
ARPG
MIT
ARPG is an innovative autoregressive image generation framework capable of achieving BERT-style masked modeling through a GPT-like causal architecture.
Image Generation
A
hp-l33
68
2
Qwen2.5 14B CIC ACLARC
Apache-2.0
A citation intent classification model fine-tuned based on Qwen 2.5 14B Instruct, specifically designed for citation intent classification in scientific publications.
Text Classification
Transformers English

Q
sknow-lab
24
2
Eagle2 1B
Eagle 2 is a high-performance vision-language model family that focuses on transparency in data strategies and training schemes, aiming to drive the open-source community in developing competitive vision-language models.
Image-to-Text
Transformers Other

E
nvidia
1,791
23
Aim Xlarge
MIT
AiM is an unconditional image generation model based on PyTorch, integrated and pushed to Hugging Face Hub via PytorchModelHubMixin.
Image Generation
A
hp-l33
23
5
Minicpm Llama3 V 2 5 GGUF
MiniCPM-Llama3-V-2_5 is a multimodal visual question answering model based on the Llama3 architecture, supporting both Chinese and English interactions.
Text-to-Image Supports Multiple Languages
M
gaianet
112
3
Depth Anything V2 Metric Indoor Large Hf
A fine-tuned version of Depth Anything V2 for indoor metric depth estimation using the synthetic Hypersim dataset, compatible with the transformers library.
3D Vision
Transformers

D
depth-anything
47.99k
9
Depth Anything V2 Metric Indoor Base Hf
A version fine-tuned for indoor metric depth estimation tasks using the Hypersim synthetic dataset, based on the Depth Anything V2 model
3D Vision
Transformers

D
depth-anything
9,056
1
Depth Anything V2 Metric Indoor Small Hf
A model fine-tuned from Depth Anything V2 for indoor metric depth estimation tasks, trained on the synthetic dataset Hypersim, compatible with the transformers library.
3D Vision
Transformers

D
depth-anything
750
2
Depth Anything V2 Metric Outdoor Small Hf
A fine-tuned version of Depth Anything V2, specifically designed for metric depth estimation in outdoor scenes, trained on the synthetic dataset Virtual KITTI.
3D Vision
Transformers

D
depth-anything
459
1
Chronos T5 Base
Apache-2.0
Chronos is a family of pre-trained time series forecasting models based on language model architectures, which transform time series into token sequences for training through quantization and scaling.
Climate Model
Transformers

C
autogluon
82.42k
5
Blip2 Test
MIT
BLIP-2 is a vision-language model based on OPT-2.7b, which achieves image-to-text generation by freezing the image encoder and large language model while training a query transformer.
Image-to-Text
Transformers English

B
advaitadasein
18
0
Blip2 Flan T5 Xxl
MIT
BLIP-2 is a vision-language model that combines an image encoder with the large language model Flan T5-xxl for image-to-text tasks.
Image-to-Text
Transformers English

B
Salesforce
6,419
88
Blip2 Flan T5 Xl
MIT
BLIP-2 is a vision-language model based on Flan T5-xl, pre-trained by freezing the image encoder and large language model, supporting tasks such as image captioning and visual question answering.
Image-to-Text
Transformers English

B
Salesforce
91.77k
68
Flan T5 Xl
Apache-2.0
FLAN-T5 XL is an instruction-finetuned language model based on the T5 architecture, with significantly improved multilingual and few-shot performance after fine-tuning on 1000+ tasks.
Large Language Model Supports Multiple Languages
F
google
257.40k
494
Gpt2 Question Answering Squad2
A question-answering model based on the GPT-2 architecture, fine-tuned specifically for the SQuAD2 dataset, capable of answering questions based on given text.
Question Answering System
Transformers

G
danyaljj
16
2
Monot5 Base Msmarco
A re-ranking model based on the T5-base architecture, fine-tuned for 100,000 steps on the MS MARCO passage dataset, suitable for document re-ranking tasks in information retrieval.
Large Language Model
M
castorini
7,405
11
Deberta V3 Base Mnli
DeBERTa-v3 model trained on the MultiNLI dataset for natural language inference tasks, excelling in zero-shot classification scenarios.
Text Classification
Transformers English

D
MoritzLaurer
14.53k
6
Zeroaraelectra
Other
A zero-shot classification model for Arabic, supporting natural language inference tasks
Text Classification
Transformers Supports Multiple Languages

Z
KheireddineDaouadi
39
0
Featured Recommended AI Models